Set Balancing
   HOME

TheInfoList



OR:

The set balancing problem in mathematics is the problem of dividing a set to two subsets that have roughly the same characteristics. It arises naturally in
design of experiments The design of experiments (DOE, DOX, or experimental design) is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. The term is generally associ ...
. There is a group of subjects. Each subject has several features, which are considered binary. For example: each subject can be either young or old; either black or white; either tall or short; etc. The goal is to divide the subjects to two sub-groups: treatment group (T) and
control group In the design of experiments, hypotheses are applied to experimental units in a treatment group. In comparative experiments, members of a control group receive a standard treatment, a placebo, or no treatment at all. There may be more than one tr ...
(C), such that for each feature, the number of subjects that have this feature in T is roughly equal to the number of subjects that have this feature in C. E.g., both groups should have roughly the same number of young people, the same number of black people, the same number of tall people, etc.


Matrix representation

Formally, the set balancing problem can be described as follows. m is the number of subjects in the general population. n is the number of potential features. The subjects are described by A, an n\times m matrix with entries in . Each column represents a subject and each row represents a feature. a_=1 if subject j has feature i, and a_=0 if subject j does not have feature i. The partition to groups is described by b, an m\times 1 vector with entries in . b_j=1 if subject j is in the treatment group T and b_j=-1 is subject j is in the control group C. The balance of features is described by c = A\cdot b. This is an n\times 1 vector. The numeric value of c_i is the imbalance in feature i: if c_i>0 then there are more subjects with i in T and if c_i<0 then there are more subjects with i in C. The ''imbalance'' of a given partition is defined as: ::I(b) = , , A\cdot b, , _\infty = \max_, c_i, The set balancing problem is to find a vector b which minimizes the imbalance I(b).


Randomized algorithm

An approximate solution can be found with the following very simple
randomized algorithm A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performan ...
: ::Send each subject to the treatment group with probability 1/2. In matrix formulation: ::Choose the elements of b randomly with probability 1/2 to each value in . Surprisingly, although this algorithm completely ignores the matrix A, it achieves a small imbalance
with high probability In mathematics, an event that occurs with high probability (often shortened to w.h.p. or WHP) is one whose probability depends on a certain number ''n'' and goes to 1 as ''n'' goes to infinity, i.e. the probability of the event occurring can be mad ...
when there are many features. Formally, for a random vector b: ::Prob\left I(b) \geq \sqrt\right\leq \frac PROOF: Let k_i be the total number of subjects that have feature i (equivalently, the number of ones in the i-th of the matrix A). Consider the following two cases: Easy case: k_i \leq \sqrt. Then, with probability 1, the imbalance in feature i (that we marked by c_i) is at most \sqrt. Hard case: k_i > \sqrt. For every j, let X_j = a_b_j. Each such X_j is a random variable that can be either 1 or -1 with probability 1/2. The imbalance in feature i is: c_i = \sum_^m. Since the X_j are independent random variables, by the
Chernoff bound In probability theory, the Chernoff bound gives exponentially decreasing bounds on tail distributions of sums of independent random variables. Despite being named after Herman Chernoff, the author of the paper it first appeared in, the result is d ...
, for every a>0: ::Prob\left c_i, \geq_\sqrt\right\leq_\frac By_the_
union_bound In probability theory, Boole's inequality, also known as the union bound, says that for any finite or countable set of events, the probability that at least one of the events happens is no greater than the sum of the probabilities of the individ ...
,_ ::Prob\left c_i, \geq_\sqrt\right\leq_\frac.


__References_

{{reflist Design_of_experiments Randomized_algorithmshtml" ;"title="c_i, \geq a\right] \leq 2 \exp(-a^2/2m) Select: a=\sqrt and get: ::Prob\left c_i, \geq \sqrt\right\leq \frac By the
union bound In probability theory, Boole's inequality, also known as the union bound, says that for any finite or countable set of events, the probability that at least one of the events happens is no greater than the sum of the probabilities of the individ ...
, ::Prob\left c_i, \geq \sqrt\right\leq \frac.


References

{{reflist Design of experiments Randomized algorithms>c_i, \geq a\right\leq 2 \exp(-a^2/2m) Select: a=\sqrt and get: ::Prob\left c_i, \geq \sqrt\right\leq \frac By the
union bound In probability theory, Boole's inequality, also known as the union bound, says that for any finite or countable set of events, the probability that at least one of the events happens is no greater than the sum of the probabilities of the individ ...
, ::Prob\left c_i, \geq \sqrt\right\leq \frac.


References

{{reflist Design of experiments Randomized algorithms